Nbtadata kknagement for Large Statistical Databases
نویسنده
چکیده
Data description or metadata presents a significant database management challenge, particularly for scientific and statistical databases. Ideally, we would llke to access and manipulate data and metadata using the same DBMS tools, but there are few systems that even begin to provide such integrated capabilities. This paper outlines a framework for more integrated metadata management by synthesizing ideas from statistical analysis, bibliographic retrieval, data dictionary, and database management systems. Drawing on experience and examples from a large statistical database project, the paper discusses and analyzes: d general types and uses of data about data * special types of metadata for statistical databases * metadata structure and characteristics * principles and requirements for metadata management
منابع مشابه
Statistical Computing and Databases: Distributed Computing Near the Data
This paper addresses the following question: “how do we fit statistical models efficiently with very large data sets that reside in databases?” Nowadays it is quite common to we encounter a situation where a very large data set is stored in a database, yet the statistical analysis is performed with a separate piece of software such as R. Usually it does not make much sense and in some cases it ...
متن کاملImproved emotion recognition with large set of statistical features
This paper presents and discusses the speaker dependent emotion recognition with large set of statistical features. The speaker dependent emotion recognition gains in present the best accuracy performance. Recognition was performed on English, Slovenian, Spanish, and French InterFace emotional speech databases. All databases include 9 speakers. The InterFace databases include neutral speaking s...
متن کاملScaling EM (Expectation-Maximization) Clustering to Large Databases
Practical statistical data clustering algorithms require multiple data scans to converge. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework requiring at most one scan of the database, and apply it to the Expectation-Maximization (EM) algorithm. Unlike distance-based or hard membership algorithms (such as k-Means) EM is known to be an app...
متن کاملConceptual Clustering of Heterogeneous Distributed Databases
With increasingly more databases becoming available on the Internet, there is a growing opportunity to globalise knowledge discovery and learn general patterns, rather than restricting learning to specific databases from which the rules may not be generalisable. Clustering of distributed databases facilitates learning of new concepts that characterise common features of, and differences between...
متن کاملClustering of highly homologous sequences to reduce the size of large protein databases
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive datab...
متن کامل